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ABSTRACT 


The first intelligent tutoring systems for computer program- 
ming have been proposed more than 30 years ago, mostly fo- 
cusing on well defined programming tasks e.g. in the context 
of logic programming. Recent systems also teach complex 
programs, where explicit modelling of every possible pro- 
gram and mistake is no longer possible. Such systems are 
based on data-driven approaches, which focus on the syn- 
tax of a program or consider the output for example cases. 
However, the system’s understanding of student programs 
could be enriched by a deeper focus on the actual execution 
of a program. This requires a suitable data representation 
which encodes information of programming style as well as 
its functionality in a suitable way, thus offering entry points 
for automated feedback generation. 


In this contribution we propose a representation of com- 
puter programs via execution traces for example input and 
demonstrate the power of this representation in three key 
challenges for intelligent tutoring systems: identifying the 
underlying solution strategy, identifying erroneous solutions 
and locating the errors in erroneous programs for feedback 
display. 
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1. INTRODUCTION 


Teaching computer programming has been a long-standing 
goal of intelligent tutoring systems research. The earliest 
example, the LISP tutor, has been released in 1985 [1] and 
since then many different approaches have evolved, such as 
learning by examining and manipulating examples, by sim- 
ulation and debugging, by dialogue with the system, by col- 
laboration with peers or by feedback [7]. Most of these ap- 
proaches rely on extensive domain knowledge about program 
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structure, typical mistakes (so-called buggy rules) and syn- 
tactic concepts, which is expensive to obtain and difficult to 
encode [5, 10]. In particular, such approaches get infeasible if 
the space of possible programs (and mistakes) gets too large, 
and if the goal of the computer program is ill-defined [8]. To 
push the boundaries of intelligent tutoring systems towards 
such scenarios, data-driven approaches have been developed 
which provide feedback to students based on example pro- 
grams handed in by other students, e.g. by highlighting the 
difference of the student solution and a similar, correct pro- 
gram [2, 16]. However, such approaches focus strongly on 
the syntax of programs, which is problematic because the 
relation between a programs functionality and its syntax is 
highly non-linear. 


As an example, consider the Java code shown in Figure 1. 
The programs on the left and on the middle are both (cor- 
rect) sorting programs, which have a very similar syntactic 
structure. Both sort the array via two nested loops, com- 
pare the current element to its successor and swap them if 
he order is incorrect. However, the programs implement dif- 
ferent algorithms, namely BubbleSort (left) and Insertion- 
Sort (middle). Thus, minor syntactic changes correspond 
o major changes in terms of function [14]. If an intelligent 
utoring system provides feedback based on a functionally 
dissimilar example (e.g. a different underlying algorithm) 
he system might recommend changes to the student’s pro- 
gram which lead the learner away from her intended strat- 
egy. Such feedback might be detrimental to the student’s 
learning success. 


This poses a challenge to educational datamining research. 
How do we estimate the similarity between programs on a 
functional level, without exceeding effort in knowledge en- 
gineering? We propose to represent computer programs by 
their execution traces, to compare such traces using sequence 
alignment and to define the similarity between programs 
based on the alignment distance. An execution trace is a 
sequence of variable states for each step of the program’s 
execution for some input. They are a usual representation of 
computer programs for debugging purposes and can provide 
insight into the dynamic behaviour of programs [6]. In par- 
ticular, traces and alignments of traces have been success- 
fully applied in educational programming environments to 
offer students an alternative view on their own program for 
self-reflection [17, 18]. We build upon this research by uti- 
lizing the trace representation for educational datamining, 
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public static int[] bubblesort(int[] A) { 


} } 


return A; 


} 3 


return A; 


public static int[] insertionSort(int[] A) { 


public static int[] insertionSort(int[] A) { 


insertionSort(A, 1, 
return A; 


v); 
} 


private static void insertionSort(int[] A, int 1, int r) { 


insertionSort(A, 1, r - 1); 
insert(A, 1, r); 


} 


private static void insert(int[] A, int 1, int r) { 


insert(A, 1, r - 1); 
} 
} 


Figure 1: Three correct sorting programs in Java code. Important syntactic constructs and variable ini- 
tializations are highlighted. The corresponding code parts between all three programs are visualized via 
background highlighting. Left: An iterative BubbleSort implementation. Middle: An iterative InsertionSort 
implementation. Right: A recursive InsertionSort implementation. 


Bubble Insertion recursive 
[4, 7, 2, 1] [4, , 2, 1] [4, 7, 2, 1] 
[4,2,7,1] [4,2,7,1] [4,2,7,1] 
[4, 2, 1,7] (2,4, 7, 1] (2,4, 7, 1] 
[2,4,1,7] [2,4,1,7]  [2,4,1, 7] 
(2, 1,4, 7] (2, 1,4, 7 (2, 1,4, 7] 
[1,2,4,7] [1,2,4,7]  [1,2,4, 7] 


Table 1: The execution traces for the three pro- 
grams from Figure 1 for the input array A = [4,7, 2, 1]. 
Only the values for the variable A are shown and 
intermediate steps that do not manipulate A have 
been omitted. 


that is, for automated classification and analysis of student’s 
computer programs in order to provide helpful, automated 
feedback. 


As an example, consider the programs from Figure 1 again. 
Their execution traces for the input array A = [4,7, 2,1] are 
shown in Table 1. Despite the apparent syntactic similar- 
ity, the implementations of BubbleSort and InsertionSort do 
indeed map to different traces, while the iterative and recur- 
sive implementation of InsertionSort map to the same trace. 
This indicates that traces have a more direct relationship to 
the semantics of the underlying program, making them a 
promising representation for intelligent tutoring systems. 


The main contributions of our work are as follows: First, we 
introduce execution traces with the purpose to capture syn- 
tactic as well as semantic aspects of the underlying program 
(Section 3). Second, we provide an efficient methodology 
for automatically comparing such traces via edit distances 
and inferring a measure of similarity for further datamin- 
ing applications (Section 4). Finally, we evaluate our ap- 
proach in comparison with the state of the art in syntactic 
representation in three key challenges for educational data 
mining: 1.) identifying the student’s underlying algorith- 
mic approach (Section 5.2), 2.) identifying erroneous im- 
plementations (Section 5.3), and 3.) detecting the location 
of errors for feedback (Section 5.4). To our knowledge, no 
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data-driven approach exists to date which tackles all three 
challenges. Syntax-based representations have been success- 
ful in identifying the programming strategy [11, 13] but fail 
in identifying erroneous solutions as well as error locations 
(as we will show later). On the other hand, test case-based 
evaluations are very successful in identifying erroneous solu- 
tions but treat programs as a black box and thus can make 
no claims regarding the implemented strategy or the loca- 
tion of the error [17]. 


2. BACKGROUND AND RELATED WORK 


2.1 Tutoring Systems for Computer Program- 
ming 

In a review of Al-supported tutoring approaches for com- 
puter programming, Le and colleagues found six categories 
of approaches, namely: 1.) displaying examples of programs 
in order to learn to construct programs of a similar type 
or modify examples; 2.) simulating the execution of a pro- 
gram in a micro-world and visualizing it to the user; 3.) 
providing a dialogue environment in order to complete a 
programming task in an interactive dialogue with the sys- 
tem; 4.) presenting buggy example code in order to learn 
via program analysis and debugging; 5.) providing feedback 
to students during development of their program in order 
to guide them towards a correct solution and detect errors; 
and finally 6.) providing a collaborative work environment 
in which students can help each other in developing a pro- 
gram, guided by the system’s group model [7]. We note that 
Le and colleagues do not yet consider recent data-driven ap- 
proaches, which are mostly feedback-based systems, such as 
the FIT Java Tutor [2], BOTS [4] and ITAP [16]. Our own 
approach is targeted mainly at such feedback-based systems 
working on examples. We analyze the execution trace of a 
student’s program in order to find similar programs for feed- 
back purposes and we intend to locate errors in the student’s 
program to help her correct them. However, our approach 
also bears similarity to simulation-based approaches as we 
consider the execution of the program’s statements as the 
main characteristic of a program. 
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2.2 Representations of Computer Programs for 


Data-Driven Systems 

Most existing data-driven systems for computer program- 
ming represent programs as abstract syntax trees, which 
are subjected to some form of canonalization in order to 
abstract from mere stylistic differences [15]. Recently, Piech 
and colleagues have criticized this approach and judged syn- 
tax trees not sufficiently discriminative to capture the strong 
functional consequences of small syntactic changes [14]. In- 
stead, they propose a neural network-based approach to infer 
a vectorial representation of programs, such that standard 
machine learning methods can be applied in the resulting 
Euclidean space. Similar to our approach, Piech and col- 
leagues intend to represent a programs function (or seman- 
tics) in opposition to its syntax. However, they focus on a 
direct mapping between input and output of program seg- 
ments, while the trace representation provides more proce- 
dural (or dynamic) insight into the programs function. 


2.3 Edit Distances on Computer Programs 
Computing similarities and dissimilarities between computer 
programs is a crucial step towards data-driven intelligent 
utoring system [9]. Edit distances have been particularly 
prominent in this regard. For example, Rivers and Koedinger 
used tree edit distances to compute similarities between syn- 
ax trees of Python programs to identify adjacent states [16]. 
Gross and colleagues similarly applied edit distances on syn- 
ax trees to infer clusters of computer programs and select 
he most similar sample solution for feedback [2, 3]. Finally, 
Paafen, Mokbel and Hammer have identified the underly- 
ing algorithm of sorting programs using machine learning 
echniques based on alignment distances and adapted the 
parameters of those alignment distances to yield better clas- 
sification results [11, 13]. Note that all these approaches rely 
on alignment distances on program syntaz, not on execution 
traces. Striewe and Goedicke applied sequence alignment on 
execution traces, but did not apply the alignment distances 
for further datamining purposes [18]. 


2.4 Classification of Computer Programs 
Recently, the value of classification methods for feedback 
provision in intelligent tutoring systems for computer pro- 
gramming has been recognized. Such machine learning meth- 
ods enable tutoring systems to infer e.g. the underlying pro- 
gramming strategy of a learner with explicit human labelling 
only for a small example set [13]. Piech and colleagues report 
multiplication factors of up to 214, that is, a human tutors 
annotation for one program permits inference of said anno- 
tation for up to 214 other programs [14]. Of course, such 
approaches rely on a representation of computer programs 
in a format that can be fed into machine learning methods, 
such as pairwise similarities and dissimilarities [9, 13] or an 
explicit vectorial embedding [14]. In this contribution, we 
employ a classification paradigm to distinguish between dif- 
ferent algorithmic approaches, as well as between erroneous 
and correct solutions. 


3. REPRESENTING COMPUTER PROGRAMS 


VIA EXECUTION TRACES 


In general, execution trace recordings can be defined as the 
“detection and storage of relevant events during run-time, 
for later off-line analysis” [6]. More specifically, we consider 
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executions of statements in the program as relevant events, 
which we characterize by the value of variables of interest 
after the statement has been executed. This is equivalent to 
a step-wise execution of the program in a debugger, where we 
record the state of an interesting variables in each step [17]. 
As an example, consider traces in Table 1 for the programs 
in Figure 1. 


Only modest technical requirements have to be fulfilled to 
apply a trace representation. 1.) The programming lan- 
guage has to offer a debugging environment which permits 
monitoring of a program’s execution; 2.) a valid and non- 
trivial example input for the task has to be available; and 3.) 
the student’s program has to compile and execute without 
errors on the example input [17]. Thus, the trace repre- 
sentation is more demanding compared to the very flexible 
syntactic representation of computer programs, but has less 
prerequisites compared to extensive knowledge engineering. 
In that sense, the trace representation can be seen as a “mid- 
dle road” between entirely data-driven approaches and sys- 
tems based on expert knowledge. 


4. COMPARING EXECUTION TRACES 


If a student’s program is analyzed via test cases, the output 
is compared with the pre-defined reference value via a simple 
equality test. However, such a strict equality test is not a 
viable option for the comparison of execution traces. For 
example, the traces on the left and the middle in Table 1 
are not equal. But they are more similar to each other than 
to an erroneous program that does not sort the input array 
at all. Therefore, we require a more flexible measure of 
similarity or dissimilarity between traces [9]. 


Similarities and dissimilarities on sequential data can be ob- 
tained via alignment distances or edit distances. The over- 
arching scheme is to extend both input sequences such that 
there length becomes equal and similar elements of both se- 
quences become aligned. The alignment distance is then de- 
fined as the summed cost over all aligned elements [13]. The 
choice of alignment algorithm depends on the extensions of 
input sequences that should be permitted. In case of exe- 
cution traces we intend to abstract from sequence elements 
that leave the relevant variables unchanged. As an exam- 
ple, consider lines two and three of the program in Figure 1 
(left). These two lines could be removed from the program 
without changing its function, if all expressions of r and 1 
are replaced by their value in the rest of the program. A 
classic edit distance scheme would punish this with a higher 
dissimilarity between the shorter and the longer version of 
the program. Instead, we propose that the same state of the 
relevant variables may be copied without cost. This corre- 
sponds to the dynamic time warping dissimilarity Dprw for 
speech processing, first introduced by Vintsyuk [20]. Given 
two traces & = (%1,...,¢.7) and 9 = (yi,..., yn) as well as 
a dissimilarity measure d(x;,y;) between the variable states 
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Figure 2: An illustration of the dynamic time 
warping distance between two traces. Aligned ar- 
ray states are connected by yellow background. 
Mismatching parts of the aligned variable states 
are highlighted in red. The dissimilarity between 
aligned array states is shown in the middle. 


x; and y;, it is defined recursively as: 


Dorw ( (a, sei) (Yay 45) = (xi, yj) + min { (1) 
Dow ((a1,-.-,2i-1), ome ui-1)); 
Dow ((a1,-+-52i-1); (1y-++54))s 
Dorw((e1,.--2%), (yts-- .ui-1))} 


Dow ( (1), (vi) := d(x1,y1) (2) 


This can be calculated efficiently in O(M - N) via dynamic 
programming (Dprw is tabulated for all prefixes of £ and 


y)- 


An illustration of the dynamic time warping dissimilarity 
between two example traces is shown in Figure 2. The first 
three array states of the left trace are just repetitions and 
thus are aligned with the first array state of the right trace. 
This occurs again for the fourth to sixth array state of the 
left trace. Only afterwards the array states differ and lead 
to a non-zero dissimilarity between both traces. Note that 
the explicit alignment of array states between two compared 
traces in dynamic time warping can be retrieved efficiently 
via backtracing in linear time. 


As other edit distances, the dynamic time warping algorithm 
crucially relies on a dissimilarity measure between variable 
states. If prior knowledge regarding the interesting variables 
is available, defining such a measure becomes fairly straight- 
forward (e.g. a Hamming-distance on arrays, just counting 
the number of unequal entries). In absence of such prior 
knowledge, defining a dissimilarity on variable states be- 
comes a challenge in itself. One has to infer a semantic 
matching between the variables in both programs, deter- 
mine their relevance (as some variables might be less central 
to the semantic function than others) and then compute the 
relevance-weighted distance between all matched variables. 
As a first step in this direction, we propose a simple sum- 
mary scheme. We build a histogram Hz, in each state 2; 
that counts the number of variables of each type t € 7, and 
compare these histograms with a normalized L1 distance: 


Ht) = Ha 
a oe Orme 


d(xi, yj) 
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Note that we consider only types ¢ which occur in both pro- 
grams at least once. 


5. EXPERIMENTS 


Our experimental evaluation concerns three key challenges 
for data-driven intelligent tutoring systems: 1.) Identifying 
the underlying algorithmic approach, 2.) identifying erro- 
neous programs, and 3.) detecting the location of an error, 
once a program is identified as erroneous. We compare the 
performance on these tasks between the trace representa- 
tion (with dynamic time warping as dissimilarity measure) 
and the state-of-the-art in terms of syntax representation: 
syntax-trees with learned edit distance parameters via ma- 
chine learning techniques [13]. As implementation of the 
alignment techniques we applied the TCS Alignment Tool- 
box [12]. 


5.1 Datasets 

For our evaluation, we use two benchmark datasets. The 
palindrome data set consists of 48 (correct) programs decid- 
ing whether all words in an input sentence are palindromic, 
using one of eight different programming strategies [9]. We 
used the histogram-approach to define a dissimilarity be- 
tween variable states and generated traces using the input 
sentence “OTTO ANNA MOPS”. As this data set does not 
contain erroneous programs, we only used it for the first 
experiment. 


The second dataset is an extended version of the sorting 
dataset from [11]. It consists of 126 (correct) sorting pro- 
grams collected from various web sources, each implement- 
ing one of six sorting algorithms (35 BubbleSorts, 29 Inser- 
tionSorts, 15 MergeSorts, 17 QuickSorts, 20 SelectionSorts 
and 10 ShellSorts). For each of the programs we created an 
erroneous counterpart, with one or more semantic errors, 
that is, errors that are neither detected by the compiler nor 
do they lead to a program crash (e.g. due to an index being 
out of bounds). Thereby, we focused on errors that are non- 
trivial to detect for technical systems. As a dissimilarity 
between variable states we employed a Hamming distance 
on the array to be sorted. As input we generated a uniform 
random array of 10 integers in the range [0, 99]. 


Both datasets are available online at http://doi.org/10. 
4119/unibi/2900666 and http: //doi.org/10.4119/unibi/ 
2900684 respectively. 


5.2 Classifying Programming Strategies 

Our first experiment concerns the identification of the un- 
derlying sorting algorithm. We assume that a human expert 
has already labelled some example programs and want to in- 
fer the correct label for some new, unlabelled program. We 
evaluate the classification accuracy of an 1-nearest neighbor 
classifier for the syntactic as well as the trace-based represen- 
tation in a crossvalidation with 6 folds (for the palindrome 
dataset) and 10 folds (for the sorting dataset) respectively. 


The results are shown in Table 2. For the palindrome dataset, 
the accuracy for the trace representation is more than 10% 
higher compared to the syntactic representation. Yet, likely 
due to the small sample size, this difference is not significant 
(Wilcoxon rank-sum test). In case of the sorting data set, 
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Figure 3: The sorting dataset embedded in 2 dimensions via t-stochastic neighborhood embedding (t-SNE) 
[19]. The sorting algorithms are indicated by color. On the left side, the embedding is shown for adapted, syn- 
tactic edit distances [13]. On the right side, we show the embedding for dynamic time warping dissimilarities 


on traces. 


palindromes sorting 
method acc. std. dev. acc. — std. dev. 


syntax 0.875 0.158 0.812 0.068 
traces 0.979 0.051 0.954 0.040 


Table 2: The mean classification accuracy and its 
standard deviation of a 1-nearest neighbor classi- 
fier distinguishing six different sorting algorithms. 
Mean and standard deviation are calculated across 
6 (for palindromes) and 10 (for sorting) crossvalida- 
tion trials. 


we gain an increase in accuracy of more than 14%, which 
is highly significant (p < 0.01, Wilcoxon rank-sum test). 
This is also reflected in the corresponding dissimilarities. 
In Figure 3 we show 2-dimensional embeddings of the sort- 
ing dataset according to syntax-based (left) and trace-based 
(right) dissimilarities. The trace representation yields more 
compact clusters corresponding to the correct class label, 
thereby making classification easier. Interestingly, closer in- 
spection of the misclassified data points for the trace rep- 
resentation revealed that the 1-nearest neighbor classifier 
correctly identified a BubbleSort implementation the pro- 
grammers had wrongly labelled as an InsertionSort. 


In order to apply a classification algorithm in praxis, labelled 
data is required. To reduce human work, one would like to 
reduce the amount of labelled data necessary. We tested the 
required amount of labelled data experimentally, by reduc- 
ing the number of labelled data points (and increasing the 
number of unlabelled points). The results are displayed in 
Figure 4. For the palindrome data set, only two data points 
per class are sufficient to achieve good performance. For 
the sorting data set, about 40 labelled programs suffice to 
achieve a classification accuracy of 90% using the trace rep- 
resentation, while the classification accuracy for the syntac- 
tic representation saturates at 80% for about 60 programs. 


5.3 Classifying Erroneous Programs 
We phrase the identification of erroneous problems as a clas- 
sification task as well: We assume that a human expert 
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Figure 4: The classification accuracy on the strat- 
egy classification task using the syntactic as well as 
the trace-based data representation if the number 
of available labelled data points is reduced and the 
number of unlabelled points is increased. The upper 
plot displays the result for the palindromes dataset, 
the lower plot for the sorting dataset. The error- 
bars mark the standard deviation across 6 and 10 
crossvalidation trials respectively. 
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method Accuracy — std. dev. 


syntax 0.211 0.107 
traces 0.861 0.086 


Table 3: The mean classification accuracy and its 
standard deviation of a 1-nearest neighbor classifier 
distinguishing erroneous from correct sorting pro- 
grams. Mean and standard deviation are calculated 
across 20 crossvalidation trials. 
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Figure 6: The classification accuracy on the error 
classification task using the syntactic as well as the 
trace-based data representation if the number of 
available labelled data points is reduced and the 
number of unlabelled points is increased. The error- 
bars mark the standard deviation across 20 crossval- 
idation trials. 


has labelled a few example programs as correct and erro- 
neous respectively. Then, we want to infer the label for new 
programs. We evaluate the classification accuracy of an 1- 
nearest neighbor classifier in a 20-fold crossvalidation. 


The results are shown in Table 3. As expected, the syntactic 
information is not at all sufficient to judge the correctness 
of a program. The trace-based representation, on the other 
hand, identifies correct and false solutions in most cases 
(about 86% accuracy). Again, we can observe the difference 
between both representation in 2-dimensional embeddings. 
Figure 5 shows embeddings for the syntactic-based (left) as 
well as the trace-based dissimilarities (right). While erro- 
neous and correct solutions are almost indistinguishible for 
the former representation, we observe a much clearer sepa- 
ration of the classes for the latter representation. 


We also tested the classification performance if less labelled 
data is available (see Figure 6). Interestingly, the classifi- 
cation accuracy of the syntactic representation decreases if 
more labelled data is available. This is likely due to the fact 
that we created the erroneous programs based on the cor- 
rect ones, such that the nearest neighbor from a syntactic 
point of view often was the respective counterpart solution, 
such that errors get more prevalent if more of such neighbors 
are available for classification (also refer to Figure 5). Con- 
versely, the trace representation steadily increases in perfor- 
mance and reaches 80% accuracy at about 50 labelled data 
points. 
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5.4 Detecting Error Locations 

As a final challenge, we try to locate the errors within the 
erroneous programs. More precisely, the challenge is to iden- 
tify a set of lines of code in an erroneous program, such that 
all errors are included, but few other lines are included. Such 
a set of lines can then be utilized in an intelligent tutoring 
system. The identified lines can be highlighted such that 
the student is able to find the error in her program. We 
apply two strategies based on alignment algorithms, one on 
the syntactic representation and one on the trace represen- 
tation. 


Syntax-Based Error Detection. We select the nearest cor- 
rect neighbor and retrieve a syntactic alignment of the er- 
roneous program and the correct program via backtracing. 
Thereby we obtain the contribution of each line of code in 
the erroneous program to the overall alignment distance. In 
order to identify contributing neighbors as well, we apply 
Gaussian blur to this distribution and then select the line of 
code with the highest contribution as well as its neighbors, 
if their contribution is sufficiently high (at least half as high 
compared to the maximum). 


Trace-Based Error Detection. Our trace-based strategy 
is similar to the syntax-based one. We again select the near- 
est correct neighbor and retrieve a trace alignment of the 
erroneous program and the correct program via backtrac- 
ing. However, we can apply additional domain knowledge. 
We assume that an erroneous program has the wrong out- 
put given the input. The output of the program includes 
the value of the relevant variables at the end of the trace. 
Therefore, we can start from the end of the trace alignment 
and work back until the state of the relevant variables is 
equal to the state in the correct program. This is the point 
where the error in the program influences the programs ex- 
ecution negatively. However, it is not sufficient to highlight 
this particular line of code, because the actual error might 
be earlier in the code (e.g. a wrongly set index). Therefore, 
we select not only this line, but the most frequently exe- 
cuted five lines of code until the last change of the relevant 
variables. 


Further, we included three trivial baseline strategies for com- 
parison: 1.) Selecting a line of code at random, 2.) selecting 
a line of code at random according to its distribution in the 
trace, and 3.) selecting all lines in the program that occured 
in the trace. 


We evaluated all five strategies in a 20 fold crossvalidation. 
For each erroneous program, we excluded the correct coun- 
terpart from the available neighbors in order to make the 
scenario more realistic. 


The results are shown in Table 4. We report the classic 
pattern recognition measures precision (how many of the se- 
lected lines of code contain an error?), recall (how many of 
the erroneous lines of code have been selected?) and F1- 
score (harmonic mean of precision and recall). In terms of 
F 1-score, the trace-based error detection method clearly out- 
performs the syntax-based one (p < 10~*, Wilcoxon rank- 
sum test). Further, as expected, both random baseline meth- 
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Figure 5: The sorting dataset including erroneous solutions embedded in 2 dimensions via t-stochastic neigh- 
borhood embedding (t-SNE) [19]. The correctness of each program is indicated by color. On the left side, 
the embedding is shown for adapted, syntactic edit distances [13]. On the right side, we show the embedding 
for dynamic time warping dissimilarities on traces. 


method precision std. dev. recall std. dev. Flscore — std. dev. 
traces 0.183 0.071 0.520 0.211 0.269 0.104 
syntax 0.103 0.086 0.134 0.100 0.115 0.091 
traces_random 0.157 0.122 0.119 0.098 0.134 0.107 
syntax_random 0.121 0.116 0.095 0.095 0.105 0.103 
traces_all 0.103 0.022 0.976 0.050 0.186 0.037 


Table 4: The mean classification accuracy and its standard deviation of a 1-nearest neighbor classifier dis- 
tinguishing erroneous from correct sorting programs. Mean and standard deviation are calculated across 20 


crossvalidation trials. 


ods seldomly select an erroneous line, thereby limiting the 
recall. However, selecting all lines of code occuring in a trace 
provides a strong baseline to compete with (F'l = 0.186). 
Still, the trace-based error location method performs signif- 
icantly better (p < 0.01, Wilcoxon rank-sum test). 


6. DISCUSSION 


In this contribution we introduced an alternative representa- 
ion of computer programs for classification and error detec- 
tion in intelligent tutoring systems (ITSs), namely execution 
races. On two example data sets we have demonstrated 
that this representation can improve upon state-of-the-art 
syntax-based representation in terms of strategy classifica- 
tion, error classification and error detection. In a full-blown 
ITS for computer programming, the trace representation can 
thus be applied to help students in solving programming 
asks. As soon as a student has managed to reach a working 
state (without syntax errors and program crashes) we can 
generate a trace and compare it with the traces of differ- 
ent programs. The resulting (dis-)similarity measure can be 
used to identify possible partners for peer-review and peer- 
tutoring by matching students that apply the same approach 
in their solution attempt. Further, the trace representation 
can be applied to identify erroneous programs, enabling an 
ITS to detect whether a student has finished a task or still 
needs to continue. Further, as not only the end result is 
checked but the whole execution, the trace representation 
can be utilized for detecting unusual or deceptive solutions 
that are geared towards the test cases without actually im- 
plementing the desired function. Finally, if an error is still 
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present in a student’s program but the error is not obvious, 
the trace representation may help to identify and highlight 
the location of the error in the program code, thereby pro- 
viding scaffolding to students that get stuck in searching for 
their error. 


Overall, the trace representation appears to be highly useful 
for data-driven ITSs on computer programming. However, 
important challenges remain. If no a priori knowledge re- 
garding the relevant variables in the program is available, 
computing a dissimilarity on variable states is not trivial. 
We have suggested a first attempt using a histogram of vari- 
able types. This representation, however, disregards the 
content of variables and thus is likely not sufficiently power- 
ful in many applications where differences in variable values 
are important markers of program semantics. A solution 
might be to match variables probabilistically according to 
the alignment distance a certain matching produces. This is 
an interesting direction to pursue in further research. 


Finally, we note that the trace representation does not have 
to be the sole source of information for an ITS. A syntactic 
representation is necessary when a program does not yet 
compile or crashes and wherever the high level of abstraction 
applied by a program trace is not helpful (e.g. when teaching 
certain syntactic constructs). Fusing the strengths of both 
representations is likely to lead to the best learning outcomes 
for students. 
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